iT邦幫忙

第 12 屆 iThome 鐵人賽

DAY 19
0
AI & Data

Machine Learning系列 第 19

Day19 - Feature Engineering -- 7. Date and Time Engineering (1)

  • 分享至 

  • xImage
  •  

7. Date and Time Engineering

許多資料集都會有日期和時間特徵,它們是一個重要的欄位,妥善的處理它們,可以幫助機器模型加快學習和作出較正確預測。

表示日期和時間的數字對應者日期和時間的某ㄧ個特定的部分,是很好的資訊來源。至於要從日期和時間變數中選取那些特徵完全視各個專案內容而定。

我們將使用Pandas來從日期時間欄位提取重要特徵。首先,讀取kaggle的紐約市計程車費率預測(New York City Taxi Fare Prediction)資料集。

import numpy as np
import pandas as pd

df_train =  pd.read_csv('../input/new-york-city-taxi-fare-prediction/train.csv', nrows = 2_000, parse_dates=["pickup_datetime"])

df_train.dtypes
key object
fare_amount float64
pickup_datetime datetime64[ns, UTC]
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object
df_train['pickup_datetime'].head()
0 2009-06-15 17:26:21+00:00
1 2010-01-05 16:52:16+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:42+00:00
4 2010-03-09 07:51:00+00:00
Name: pickup_datetime, dtype: datetime64[ns, UTC]

當我們有**日期時間(Datetime)**變數,我們可以提取下列資訊:

提取日期Date特徵

df_train['pickup_date'] = df_train['pickup_datetime'].dt.date
df_train[['pickup_datetime','pickup_date']].head()

/|pickup_datetime| pickup_date
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 2009-06-15
1| 2010-01-05 16:52:16+00:00| 2010-01-05
2| 2011-08-18 00:35:00+00:00| 2011-08-18
3| 2012-04-21 04:30:42+00:00| 2012-04-21
4| 2010-03-09 07:51:00+00:00| 2010-03-09

提取年Year、月Month、日Day of month特徵

df_train['pickup_year'] = df_train['pickup_datetime'].dt.year
df_train['pickup_month'] = df_train['pickup_datetime'].dt.month
df_train['pickup_day'] = df_train['pickup_datetime'].dt.day
df_train[['pickup_datetime','pickup_year','pickup_month','pickup_day']].head()

/|pickup_datetime| pickup_year| pickup_month| pickup_day
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 2009| 6| 15
1| 2010-01-05 16:52:16+00:00| 2010| 1 | 5
2| 2011-08-18 00:35:00+00:00| 2011| 8| 18
3| 2012-04-21 04:30:42+00:00| 2012| 4| 21
4| 2010-03-09 07:51:00+00:00| 2010| 3| 9

提取星期相關的特徵

# 是星期幾(Day of the week)
df_train['pickup_dayofweek'] = df_train['pickup_datetime'].dt.dayofweek
# 是周末嗎?
df_train['is_weekend'] = np.where(df_train['pickup_dayofweek'].isin([5, 6]), 1,0)
# 是當年度第幾周(Week of the year)
df_train['pickup_week'] = df_train['pickup_datetime'].dt.isocalendar().week
df_train[['pickup_datetime','pickup_dayofweek','is_weekend','pickup_week']].head()

/|pickup_datetime| pickup_dayofweek| is_weekend| pickup_week
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 0| 0| 25
1| 2010-01-05 16:52:16+00:00| 1| 0| 1
2| 2011-08-18 00:35:00+00:00| 3| 0| 33
3| 2012-04-21 04:30:42+00:00| 5| 1| 16
4| 2010-03-09 07:51:00+00:00| 1| 0| 10

提取年度相關的特徵

# 是當年度第幾季 (1 to 4)
df_train['pickup_quarter'] = df_train['pickup_datetime'].dt.quarter
# 是上半年還是下半年 (1 to 2)
df_train['pickup_semester'] = np.where(df_train['pickup_quarter'].isin([1, 2]), 1, 2)
df_train[['pickup_datetime','pickup_quarter',pickup_semester]].head()

/|pickup_datetime| pickup_quarter| pickup_semester
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 2| 1
1| 2010-01-05 16:52:16+00:00| 1| 1
2| 2011-08-18 00:35:00+00:00| 3| 2
3|2012-04-21 04:30:42+00:00 |2| 1
4| 2010-03-09 07:51:00+00:00| 1| 1

提取時間Time特徵

df_train['pickup_time'] = df_train['pickup_datetime'].dt.time
df_train[['pickup_datetime','pickup_time']].head()

/|pickup_datetime| pickup_time
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 17:26:21
1 |2010-01-05 16:52:16+00:00| 16:52:16
2| 2011-08-18 00:35:00+00:00| 00:35:00
3| 2012-04-21 04:30:42+00:00| 04:30:42
4| 2010-03-09 07:51:00+00:00| 07:51:00

提取時Hour、分Minute、秒Second特徵

df_train['pickup_hour'] = df_train['pickup_datetime'].dt.hour
df_train['pickup_minute'] = df_train['pickup_datetime'].dt.minute
df_train['pickup_second'] = df_train['pickup_datetime'].dt.second
df_train[['pickup_datetime','pickup_hour','pickup_minute','pickup_second']].head()

/|pickup_datetime| pickup_hour |pickup_minute| pickup_second
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 17| 26| 21
1| 2010-01-05 16:52:16+00:00| 16| 52| 16
2| 2011-08-18 00:35:00+00:00| 0| 35| 0
3| 2012-04-21 04:30:42+00:00| 4| 30| 42
4| 2010-03-09 07:51:00+00:00| 7| 51| 0

提取是否為上班時間,是否為上午特徵

# 是否為上班時間(business hour, 8AM 到 12AM)(1 or 0)
df_train['pickup_business'] = np.where(df_train['pickup_hour'].isin([8, 9, 10, 11]), 1, 0)
# 是否為上午
df_train['pickup_is_morning'] = np.where((df_train['pickup_hour']<12) & (df_train['pickup_hour']>6), 1, 0)
df_train[['pickup_datetime','pickup_business','pickup_is_morning']].head()

/|pickup_datetime|pickup_business| pickup_is_morning
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 0| 0
1 |2010-01-05 16:52:16+00:00| 0| 0
2 |2011-08-18 00:35:00+00:00| 0| 0
3| 2012-04-21 04:30:42+00:00| 0| 0
4| 2010-03-09 07:51:00+00:00| 0| 1


上一篇
Day18 - Feature Engineering -- 6. Feature Scaling (2)
下一篇
Day20 - Feature Engineering(特徵工程) -- 7. Date and Time Engineering (2)和時間序列
系列文
Machine Learning32
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言